A Fast Clustering System for a Huge Number of Nucleotide Sequences
نویسندگان
چکیده
Single pass sequences of mRNA, called ESTs, have been determined extensively. They have been accumulated in the dbEST database in GenBank. The number of ESTs in dbEST has become more than eight million in August 2002. By clustering and assembling ESTs, we can conduct the following analyses. First, we can obtain complete ORF sequences based on ESTs that are fragment sequences of mRNA and do not include complete open reading frames (ORFs). Second, we can examine expression profiles of genes by counting the number of ESTs obtained from each library in each cluster. Third, redundancy of EST sequences can be reduced by clustering. Finally, we can pick up ESTs that are representatives of clusters. Full-length sequences were determined for clones of representative ESTs chosen from clusters of unknown genes during the full-length human cDNA sequencing project supported by New Energy and Industrial Technology Developmental Organization (NEDO) in Japan [2]. To select clones for full-length sequencing, more than one million EST sequences were determined and clustered in this project. Now 21,243 human full-length cDNA sequences obtained in this project are published in DDBJ. Thus clustering and assembling of a huge number of ESTs are indispensable for EST analyses. It is computationally hard to see the whole relationship among millions of sequences for the purpose of clustering. Some currently available tools require more than a few days to complete clustering of such a huge number of sequences, while it is impossible for other tools to complete the task. In this paper, we present an algorithm for clustering and assembling that can complete the entire process for clustering and assembling in O(N log N) time where N is the number of sequences. We also implemented this algorithm and succeeded in assembling more than a million sequences within half a day.
منابع مشابه
Graph Clustering by Hierarchical Singular Value Decomposition with Selectable Range for Number of Clusters Members
Graphs have so many applications in real world problems. When we deal with huge volume of data, analyzing data is difficult or sometimes impossible. In big data problems, clustering data is a useful tool for data analysis. Singular value decomposition(SVD) is one of the best algorithms for clustering graph but we do not have any choice to select the number of clusters and the number of members ...
متن کاملApplying a decision support system for accident analysis by using data mining approach: A case study on one of the Iranian manufactures
Uncertain and stochastic states have been always taken into consideration in the fields of risk management and accident, like other fields of industrial engineering, and have made decision making difficult and complicated for managers in corrective action selection and control measure approach. In this research, huge data sets of the accidents of a manufacturing and industrial unit have been st...
متن کاملClustering of a Number of Genes Affecting in Milk Production using Information Theory and Mutual Information
Information theory is a branch of mathematics. Information theory is used in genetic and bioinformatics analyses and can be used for many analyses related to the biological structures and sequences. Bio-computational grouping of genes facilitates genetic analysis, sequencing and structural-based analyses. In this study, after retrieving gene and exon DNA sequences affecting milk yield in dairy ...
متن کاملSignal processing approaches as novel tools for the clustering of N-acetyl-β-D-glucosaminidases
Nowadays, the clustering of proteins and enzymes in particular, are one of the most popular topics in bioinformatics. Increasing number of chitinase genes from different organisms and their sequences have beenidentified. So far, various mathematical algorithms for the clustering of chitinase genes have been used butmost of them seem to be confusing and sometimes insufficient. In the...
متن کاملروش نوین خوشهبندی ترکیبی با استفاده از سیستم ایمنی مصنوعی و سلسله مراتبی
Artificial immune system (AIS) is one of the most meta-heuristic algorithms to solve complex problems. With a large number of data, creating a rapid decision and stable results are the most challenging tasks due to the rapid variation in real world. Clustering technique is a possible solution for overcoming these problems. The goal of clustering analysis is to group similar objects. AIS algor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002